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Abstract 

We present a unified framework which 
supports grounding natural-language 
semantics in robotic driving. This 
framework supports acquisition (learning 
grounded meanings of nouns and prepo¬ 
sitions from human annotation of robotic 
driving paths), generation (using such 
acquired meanings to generate sentential 
description of new robotic driving paths), 
and comprehension (using such acquired 
meanings to support automated driving to 
accomplish navigational goals specified 
in natural language). We evaluate the 
performance of these three tasks by 
having independent human judges rate 
the semantic fidelity of the sentences 
associated with paths, achieving overall 
average correctness of 94.6% and overall 
average completeness of 85.6%. 

1 Introduction 

With recent advances in machine perception and 
robotic automation, it becomes increasingly rel¬ 
evant and important to allow machines to in¬ 
teract with humans in natural language in a 
grounded fashion, where the language refers to 
actual things and activities in the world. Here, 
we present our efforts to automatically drive— 
and learn to drive—a mobile robot under natural- 
language command. Our contribution is summa¬ 
rized in Fig. A human teleoperator is given 
a set of sentential instructions designating robot 
paths. The operator then drives a mobile robot 
under radio control according to these instructions 
through a variety of floorplans. The robot uses on¬ 
board odometry and inertial guidance sensors to 
determine its location in real time and saves traces 
of the driving path to log files. From a training 
corpus of paths paired with sentential descriptions 


and floorplan specifications, our system automati¬ 
cally learns the meanings of nouns that refer to ob¬ 
jects in the floorplan and prepositions that describe 
both the spatial relations between floorplan objects 
and between such objects and the robot path. With 
such learned meanings, the robot can then gener¬ 
ate sentential descriptions of new driving activity 
undertaken by the teleoperator. Moreover, instead 
of manually controlling the robot through teleop¬ 
eration, one can issue the robot natural-language 
commands which can induce fully automatic driv¬ 
ing to satisfy the path specified in the natural- 
language command. 

We have conducted experiments with an actual 
radio-controlled robot that demonstrate all three 
of these modes of operation: acquisition, gener¬ 
ation, and comprehension. We demonstrate suc¬ 
cessful completion of all three of these tasks on 
hundreds of driving examples. We evaluate the fi¬ 
delity of the sentential descriptions produced au¬ 
tomatically in response to manual driving and the 
fidelity of the driving paths induced automatically 
to fulfill natural-language commands, by present¬ 
ing the pairs of sentences together with the associ¬ 
ated paths to human judges. Overall, the average 
“correctness” (the degree to which the description 
is true of the path) reported is 94.6% and the av¬ 
erage “completeness” (the degree to which the de¬ 
scription fully covers the path) reported is 85.6%. 

2 Related Work 

We know of no other work which presents a physi¬ 
cal robot which learns word meanings from physi¬ 
cal robot paths paired with sentences, uses these 
learned meanings to generate sentential descrip¬ 
tions of manually driven paths, and automatically 
plans and physically drives paths to satisfy input 
sentential descriptions. 

While there is other work which claims to learn 
the meanings of words from robot paths or fol¬ 
low natural instructions, upon further inspection 



The robot went in front of the bag 
which is left of the bag then went 
towards the chair. 








left of right of in front of behind 


stool 


table 




The robot went in front of the chair then went away 
from the chair and behind the cone then went right 
ft of the bag which is left of the cone then went left 
o of the bag which is in front of the cone then went 
away from the cone and away from the chair. 


The robot went behind the bag which is in front of 
^ the bag then went in front of the bag which is left 
ft of the chair then went towards the cone then went 
away from the chair then went right of the chair 
then went right of the bag which is left of the cone. 



Figure 1: (left) A human drives the mobile robot through paths according to sentential instructions while 
odometry reconstructs the robot’s paths. This allows the robot to learn the meanings of the nouns and 
prepositions. Hand-designed models are shown here for reference; actual learned models are shown in 
Fig.[^ Note that the distributions are uniform in velocity angle (bottom row) for left of right of in front 
of and behind and in position angle (top row) for towards and away from. These learned meanings sup¬ 
port generation of English descriptions of new paths driven by teleoperation (top right) and autonomous 
driving of paths that meet navigational goal specified in English descriptions (bottom right). 


these systems operate only within discrete simula¬ 
tion, as they utilize the internal representation of 
the simulation to obtain discrete symbolic primi¬ 


tives dTellex et al.[[20T4l[20TT||Kollar et al.[[20T0 


Chen and Mooney[|2011[|MacMahon et al.[|2006 


Koller et aLj |2010[ ). Their space of possible robot 
actions, positions and states are very small and are 
represented in terms of symbolic primitives like 
TURN LEFT, TURN RIGHT, and MOVE FORWARD 
N STEPS ( |Chen and Mooney [|MTT] ), or DRIVE TO 
LOCATION 1 and PICK UP PALLET 1 ( |Tellex et aLj 
2014| ). Thus, they take a sequence of primitives 


like {drive to location 1; pick up pallet 
1} and a sentence like go to the pallet and pick 
it up and learn that the word pallet maps to the 
primitive PALLET, that the phrase pick up maps to 
the primitive PICK UP, and that the phrase go to X 
means drive to location X. 


In contrast, our robot and environment, being 
in the continuous physical world, can take an un- 
countably infinite number of configurations. We 
take a set of sentences matched with paths of the 
robot as input, where the paths are densely sam¬ 
pled points in the real 2D Cartesian plane. Not all 
points in the path correspond to words in the sen¬ 
tences, multiple (often undescribed) relationships 
can be true of any point, and the correspondence 
between described relationships and path points is 
unknown. This is a vastly more difficult problem. 


Furthermore, previous work does not even solve 
the simplified problem without additional anno¬ 
tation. [Kollar et~ar] p010| ) requires hand-drawn 


positive and negative paths depicting specific word 
meanings. |Tellex et aL] ( |201 1[ ) requires manual an¬ 
notation of the groundings of all words in the train¬ 
ing sentences to specific objects and relationships 
in the training data. [Tellex et ^ ( |2014| ) does not 
require annotation of the grounding of each word, 
but does require manual temporal segmentation 
and alignment of paths and the pieces of multi-part 
sentences, whereas our method can learn without 
any such annotation. 

Dobnik etaL] ( |2005| ) has an actual robot but only 


learns to classify simple phrases like A is near 
B from robot paths paired with such phrases that 
have hand-grounded nouns. They can neither gen¬ 
erate sentences describing driven paths, nor au¬ 
tomatically drive a path described by a sentence. 
Our system can do both of these, as well as learn 
meanings for both nouns and prepositions. 

3 Our Mobile Robot 

All experiments were performed on a custom mo¬ 
bile robot (Fig. |^. This robot can be driven by 
a human teleoperator or drive itself automatically 
to accomplish specified navigational goals. Dur¬ 
ing all operation, robot localization is performed 
onboard the robot in real-time via an Extended 


Kalman Filter ( [Jazwinskil |1970| ) with odometry 
from shaft encoders on the wheels and inertial- 
guidance from an IMU. 

Due to sensor noise and mechanical factors such 
as wheel sliding, this localization is noisy, but gen¬ 
erally within 20cm of the actual location. The 






































































Figure 2: Our custom mobile robot. 


video feed, localization, and all sensor and actua¬ 
tor data is logged in a time-stamped format. When 
conducting experiments on generation and acqui¬ 
sition, a human teleoperator drives the robot along 
a variety of paths in a variety of floorplans. The 
path recovered from localization supports gener¬ 
ation and acquisition. When conducting experi¬ 
ments on comprehension, the path is first planned 
automatically, then the robot automatically fol¬ 
lows its planned path by comparing the new odom- 
etry gathered in real time with the planned path 
and controlling the wheels accordingly. 

The use of an actual robot with noisy real-world 
sensor data increases the difficulty of the tasks 
when compared to work which occurs in simu¬ 
lation. The noisy robot position is densely sam¬ 
pled in the continuous domain. For acquisition and 
generation, this adds an additional layer of uncer¬ 
tainty, as the correspondence between individual 
points in the robot path and the phrases of a sen¬ 
tence is unknown. 


4 Technical Details 


4.1 Grammar and Logical Form 

We employ the grammar shown in Fig. which, 
while small, supports an infinite set of possible ut¬ 
terances, unlike the grammars used in [Teller et al. 
( 2010 ) and Harris et ar] ( |2005| ). Nothing turns on 
this however. In principle, one could replace this 
grammar with any other mechanism for generat¬ 
ing logical form. This paper concerns itself with 
semantics, not syntax, and only addresses issues 
relating to the grounding of logical form. This 
particular grammar is simply a convenient surface 
representation of our logical form. 

Note that our surface syntax allows two uses 


S —?► The robot VP 

VP ^ went PPpath [then VP] 

PPpath ^ Ppath NP [and PPpath] 

NP ^ the N [PPsr] 

PPsR which is Psr NP [and PPsr] 

Ppath ^ left of I right of \ in front of \ behind \ towards \ away from 

Psr ^ left of \ right of \ in front of \ behind 

N —)► bag I box \ chair \ cone \ stool \ table 

Figure 3: The grammar used by our implementa¬ 
tion. 

of prepositions (and the associated prepositional 
phrases): as modifiers to nouns in noun phrases, 
indicated with a subscript ‘SR’ {i.e., spatial rela¬ 
tion), and as adjuncts to verbs in verb phrases, 
indicated with a subscript ‘path.’ Many prepo¬ 
sitions can be used in both SR and path form. 
They share the same semantic representation and 
both uses are learned from the pooled data of both 
kinds of occurrences in the training corpus. Fur¬ 
thermore, note that the grammar supports infinite 
NP recursion: noun phrases can contain preposi¬ 
tional phrases that, in turn, contain noun phrases. 
Finally, note that the grammar supports conjunc¬ 
tions of prepositional phrases in both SR and path 
form. 

We employ the logical form shown in Fig. 
Informally, formulas in logical form denote paths 
through a floorplan. Both paths and floorplans 
are specified as collections of waypoints. A way- 
point is a 2D Cartesian coordinate optionally la¬ 
beled with the class of the object that resides at 
that coordinate, e.g., (3,47, bag) The waypoint is 
unlabeled, ^.g.,(3,47), if no object resides at that 
coordinate. A floorplan is a set of labeled way- 
points, while a path is a sequence of unlabeled 
waypoints (Fig. [fright). A formula in logical form 
contains three parts: a path quantifier, a floorplan 
quantifier, and a condition that the path through 
the floorplan must satisfy. The condition is a con¬ 
junction of atomic formulas, predicates applied to 
variables bound by the path or floorplan quanti¬ 
fiers. The formula must be closed, Le., every vari¬ 
able in the condition must appear either in the path 
quantifier or the floorplan quantifier. The model of 
a formula is a set of bindings for each of the quan¬ 
tified path variables to unlabeled waypoints, and 
floorplan variables to labeled waypoints. 

The one-argument atomic formulas constrain 
the class of waypoints to which the variables that 
appear as their arguments are bound. The two- 
argument atomic formulas constrain the spatial re¬ 
lations between pairs of waypoints to which the 
variables that appear as their arguments are bound. 
The logical form in Fig. contains a particular 











{formula) —)■ (path quantifier) (floorplan quantifier) 

(atomic formula) {/\{atomic formula))* 

(path quantifier) —)• [(var){\ (var))*] 

(floorplan quantifier) —)• {(var){, (var))*} 

(atomic formula) —)■ (atomic formula^) 

I (atomic formulaf) 

(atomic formulaf) —)■ BAG((var)) 

I BOx( (var)) 

I CHAlR((var)) 

I CONE((var)) 

I STOOL((var)) 

I table( (var)) 

(atomicformulaf) —)■ leftOf(( var), (var)) 

I rightOf ((var), (var)) 

I lNFRONTOF((var), (var)) 

I behind ((var), (var)) 

I TOWARDS ((var), (var)) 

I awayFrom(( var) , (var)) 

Figure 4: The logical form used by our implemen¬ 
tation. 



Figure 5: Sample floorplan with robot path, 
(left) Extrinsic image taken during operation, 
(right) Internal representation of floorplan consist¬ 
ing of labeled waypoints and localized path con¬ 
sisting of unlabeled waypoints. 


set of six one-argument predicate and six two- 
argument predicates. Nothing turns on this how¬ 
ever. This is simply the set of predicates that we 
use in the experiments reported. The framework 
clearly extends to any number of predicates of any 
arity, particularly since we learn the meanings of 
the predicates. 

Straightforward (semantic) parsing and sur¬ 
face generation techniques map bidirectionally be¬ 
tween the surface language form as specified by 
the grammar in Fig. and the logical form in 
Fig.|^ For example, a surface form like 

The robot went towards the stool, then went be¬ 
hind the chair which is right of the stool, then 
went towards the cone, then went away from the 
chair which is left of the cone, then went in front 
of the table. 


(commas added for legibility) would correspond 
to the following logical form: 


[a,^,l,S,e]{t,u,v,w,x,y,z} 


/ TOWARDS(a,i) A STOOL(t)A \ 

BEHIND(/?,m) a CHAIR(u) a RIGHTOf(u, i;) A STOOL(t;)A 
TOWARDS(7, w) A CONE(t(;)A 

AWAYFrom(5, x) a CHAIR(a;) A LEFTOF(a;, y) A C0NE(2/)A 
\ INFRONTOF(e, z) A TABLE(^) / 


( 1 ) 


Note that in the above, nouns all correspond 
to one-argument predicates while prepositions all 
correspond to two-argument predicates. But noth¬ 
ing turns on this. One could imagine lexical 
prepositional phrases, like leftward, that corre¬ 
spond to one-argument predicates. Moreover, 


path uses of prepositions specify waypoints in 
the path. These appear in logical form as pred¬ 
icates whose first argument is a variable in the 
path quantifier. Similarly, SR uses of preposi¬ 
tions specify waypoints in the floorplan. These 
appear in logical form as predicates whose first 
argument is a variable in the floorplan quanti¬ 
fier. Thus, in the above, the atomic formulas 
TOWARDS(a, t), BEHIND(/3, u), T0WARDS(7, w), 
awayFrom( 5, x), and iNFRONTOF(e, 2 ;) con¬ 
stitute path uses while the atomic formulas 
RiGHTOF(i/, ^) and leftOf(x,^) constitute SR 
uses. Note that each (path) prepositional phrase 
consists of a subset of the atomic formulas in the 
condition, as indicated above by the line breaks. 


4.2 Representation of the Lexicon 


The lexicon specifies the meanings of the one- 
and two-argument predicates in logical form. The 
meanings of one-argument predicates are discrete 
distributions over the set of class labels. Note that 
the one-argument predicates, like BAG, are distinct 
from the class labels, like bag. The mapping be¬ 
tween such is learned. Moreover, a given floorplan 
might have multiple instances of objects of the 
same class. These would be disambiguated with 
complex noun phrases such as the chair which 
is right of the stool and the chair which is left 
of the cone. Such disambiguating prepositional 
phrase modifiers of noun phrases can be nested 
and conjoined arbitrarily. Similarly, waypoints 
can be disambiguated by conjunctions of prepo¬ 
sitional phrase adjuncts. 

Two-argument predicates specify relations be¬ 
tween target objects and reference objects. In 
SR uses, the reference object is the object of the 
preposition while the target object is the head 
noun. For example, in the chair to the left of the 
table, chair is the target object and table is the ref¬ 
erence object. In path uses, the target object is a 
waypoint in the robot path while the reference ob¬ 
ject is the object of the preposition. For example, 
in went towards the table, table is the reference 
object. The lexical entry for each two-argument 
predicate is specified as the location ft and con¬ 
centration n parameters for multiple independent 
von Mises distributions ( [Abramowitz and Stegun[ 
1972[ ) for a variety of angles between target and 
reference objects. 

The meanings of two-argument predicates are 
specified as a pair of von Mises distributions on 
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Figure 6: (left) How position angles are measured, 
(right) How velocity angles are measured. 


angles. One, the position angle, is the orienta¬ 
tion of a vector from the coordinates of the refer¬ 
ence object to the coordinates of the target object 
(Fig. [^left)|^ The same distribution is used both 
for SR and path uses. The second, the velocity an¬ 
gle, is the angle between the velocity vector at a 
waypoint and a vector from the coordinates of the 
waypoint to the coordinates of the reference ob¬ 
ject (Fig. [fright). This is only used for path uses, 
because it requires computation of the direction of 
robot motion which is determined from adjacent 
waypoints in the path. This angle is thus taken 
from the frame of reference of the robot. 

Fig. [^bottom left) illustrates how this frame¬ 
work is used to represent the meanings of preposi¬ 
tions. Here, we render the angular distributions as 
potential fields around the reference object at the 
center for the position angle, and the target object 
at the center for the velocity angle. The intensity 
of a point (target object for position angle) reflects 
its probability mass. Note that the distributions are 
uniform in velocity angle for left of, right of, in 
front of, and behind and in position angle for to¬ 
wards and away from. 

4.3 Tasks 

We formulate sentential semantics as a variety of 
relationships between a sentence s, or more pre¬ 
cisely a formula in logical form, a path p, a se¬ 
quence of unlabeled waypoints, a floorplan f, a 
set of labeled waypoints, and a lexicon A, the col¬ 
lective /i and n parameters for the angular distri¬ 
butions for each of the two-argument predicates 
and the discrete distributions for each of the one- 
argument predicates. 

acquisition Learn a lexicon A from a collection 
of observed paths taken by the robot in the 
corresponding floorplans as described by 
human-generated sentences s^. 
generation Generate a sentence s that describes 
an observed path p taken by the robot in a 

^Without loss of generality, angles are measured in the 
frame of reference of the robot prior to the beginning of ac¬ 
tion, which is taken to be the origin. 


given floorplan f with a known lexicon A. 
comprehension Generate a path p to be taken by 
the robot that satisfies a given sentence s is¬ 
sued as a command in a given floorplan f 
with a known lexicon A. 

4.3.1 Acquisition 

To perform acquisition, we formulate a large hid¬ 
den Markov model (HMM), with a state k for ev¬ 
ery path prepositional phrase PPpath,A; in each sen¬ 
tence in the training corpus. The observations for 
this HMM are the sequences of path waypoints 
in the training corpus. Each state’s output model 
sums over all mappings m between object ref¬ 
erences in the PPpath,fc and floorplan waypoints. 
Given such a mapping, the output model for a state 
k consists of the product of the probabilities P de¬ 
termined by each atomic formula i in the logical 
form derived from PPpath,^:^ given the probability 
models for the predicates as specified by the cur¬ 
rent estimates of the parameters in A: 


(PPpath,k5 P? f? ^5 — 


( 2 ) 


where w is the set of all path and floorplan way- 
points, and where Oij is the index in w of the jih 
argument of the iih atomic formula. 

The transition matrix for the HMM is con¬ 
structed from the sentences in the training corpus 
to allow each state only to self loop or to tran¬ 
sition to the state for the next path prepositional 
phrase in the training sentence. The HMM is con¬ 
strained to start in the state associated with the 
first path prepositional phrase in the sentence as¬ 
sociated with each path. We add dummy states, 
with a small fixed output probability, between the 
states for each pair of adjacent path prepositional 
phrases, as well as at the beginning and end of 
each sentence, to allow for portions of the path 
that are not described in the associated sentence. 


We then train this HMM with Baum-Welch (Baum 


and Petri^|1966[[Baum et al^|1970[|Baum[|197^ . 

This trains the distributions for the words in the 
lexicon A as they are tied as components of the 
output models. Specifically, it infers the latent 
alignment between the noisy robot path waypoints 
and the phrases in the training data while simul¬ 
taneously updating the meanings of the words to 
match the relationships between waypoints de¬ 
scribed in the corpus. In this way, the meanings 
of both the nouns and the prepositions are learned. 















Figure 7: Illustration of the generation algorithm. 
A disambiguating noun phrase is generated for 
each floorplan waypoint. Path waypoints are de¬ 
scribed by prepositional phrases, and then sets of 
identical phrases are merged into intervals, which 
are combined to form the sentence. 

4.3.2 Generation 

Language generation takes as input a path p ob¬ 
tained by odometry during human teleoperation of 
the robot. This path consists of a collection of 2D 
floor positions sampled at 50Hz. To generate a 
formula in logical form, and thus the correspond¬ 
ing sentence, one must select a subsequence of this 
dense sequence worthy of description. 

During generation, we care about three prop¬ 
erties: “correctness,” that the sentence be logi¬ 
cally true of the path, “completeness,” that the 
sentence differentiate the intended path from all 
other possible paths, and “conciseness,” that the 
sentence be the shortest that does so. We attempt 
to find a balance between these properties with the 
following heuristic algorithm (Fig. [^. First, we 
sample path waypoints in a way that the sampled 
points evenly distribute along the path. To this 
end, we downsample the path by computing the 
integral distance traveled from the initial position 
for each point in the dense path and selecting a 
subsequence whose points are separated by 5cm 
of integral path length. We then produce a path 
prepositional phrase to describe each path way- 
point by selecting that atomic formula with max¬ 
imum posterior probability constructed out of a 
two-argument predicate with the path waypoint as 
its first argument and with a floorplan waypoint 
as its second argument. Identical such choices for 
consecutive sets of waypoints in the path are co¬ 
alesced and short intervals of path prepositional 
phrases are discarded. We then generate a noun 
phrase for the object of each waypoint preposition 
that refers to that referenced floorplan waypoint. 
We take a one-argument predicate to be true of 
that class with maximum posterior probability and 
false of all others. Similarly, for each pair of floor- 


plan waypoints, we take that two-argument predi¬ 
cate with maximum posterior probability to be true 
of that tuple and all other predicates applied to that 
tuple to be false. Thus when the floorplan con¬ 
tains a single instance of a class, it can be referred 
to with a simple noun. But when there are multi¬ 
ple instances of a class, the shortest possible noun 
phrase, with one or more SR prepositional phrases, 
is generated to disambiguate. 

More formally, let c(e) be the class name of 
the object at the floorplan waypoint e. For each 
pair of floorplan waypoints (e,e^), there exists 
only one two-argument spatial-relation predicate 
(pn that is true of this tuple. Let d{e) be the noun 
phrase we want to generate to disambiguate the 
floorplan waypoint e from others e^. Then e can 
be referred to with d{e) unambiguously if (a) 
d{e) = (c(e), {}) is unique; or (b), there exists a 
collection of two-argument predicates {0n(e, Cn)} 
such that formula d{e) = (c(e), (7(6^2,))}) 

is unique. To produce a concise sentence, we 
want the size of the collection of two-argument 
predicates in step (b) above to be as small as pos¬ 
sible. However, finding the smallest collection of 
modifiers is NP-hard ( [Dale and Reiter[[T995| ). To 
avoid exhaustive search, we use a greedy heuristic 
that biases towards adding the least frequent 
pairs {(j)n’>d{en)) into the collection until d{e) 
is unique. This results in a tractable polynomial 
algorithm. After we get d{e), we turn it into a 
noun phrase by simple realization, for example: 

(TABLE, {(LEFT-OF, CHAIR), (BEHIND, TABLE)}) 

the table which is left of the chair and behind the table 

4.3.3 Comprehension 

To perform comprehension, we use gradient as¬ 
cent to optimize the scoring function with respect 
to an unknown path p 

p* = arg max7^(s, p, f, A) 

p 

where 7^(s, p, f, A) is the product of all from 
Eq-H We are computing a MAP estimate of the 
joint probability of satisfying the conjunction of 
atomic formulas assuming that they are indepen¬ 
dent. 

The above scoring function alone is insufficient. 
It represents the strict meaning of the sentence, but 
does not take into account constraints of the world, 
such as the need to avoid collision with the objects 
in the floorplan. It can also be difficult to optimize 






because the cost associated with the relative ori¬ 
entation between two waypoints becomes increas¬ 
ingly sensitive to small changes in position as they 
become closer together. To remedy the problems 
of the path waypoints getting too close to objects 
and to each other, a barrier penalty term is added 
between each pair of a path waypoint and floorplan 
waypoint as well as between pairs of temporally 
adjacent path waypoints to prevent them from be¬ 
coming too close. This term is 1 until the distance 
between the two waypoints becomes less than a 
threshold, at which point it decreases rapidly. Fi¬ 
nally, our formulation of the semantics of prepo¬ 
sitions is based on angles but not distance. Thus 
there is is a large subspace of the floor that leads 
to equal probability of satisfying each atomic for¬ 
mula, i.e.y the cones in Fig.[^ This allows a path to 
satisfy a prepositional phrase like to the left of the 
chair by being far away from the chair. To remedy 
this, we add a small attraction between each path 
waypoint and the floorplan waypoints selected as 
its reference objects to prefer short distances. A 
postprocessing step performs obstacle avoidance 
by adding additional path waypoints as needed. 

5 Experiments 

We conducted an experiment as outlined in Fig.[^ 
We generated 250 random sentences from the 
grammar in Fig. 25 in each of 10 different floor- 
plans that were randomly generated to place ei¬ 
ther 4 or 5 objects, with 2 objects always being 
of the same class, to introduce ambiguity requir¬ 
ing disambiguation via SR prepositional phrases, 
at one of 12 possible grid positions. Path data 
was logged while a human teleoperator manually 
drove the robot to comply with these sentential in¬ 
structions in these floorplans (Fig.j^top). Models 
were learned for each of the nouns and preposi¬ 
tions. These were used to automatically generate 
descriptions for 10 different new paths manually 
driven by a human teleoperator in 10 new random 
floorplans (Fig. [^middle). These were also used 
to automatically drive the robot to follow 10 dif¬ 
ferent new random sentences in each of 10 differ¬ 
ent new random floorplans where the same objects 
could be placed at one of 56 possible grid posi¬ 
tions (Fig. [^bottom). The random sentences used 
for training had either 2 or 3 path waypoints while 
those used for generation and comprehension had 
either 5 or 6 path waypoints. 

Odometry and inertial guidance were used to 



correctness 

completeness 


mean 

std dev 

mean 

std dev 

generation (hand-constructed models) 

94.6% 

4.54% 

85.5% 

2.26% 

generation (learned models) 

92.0% 

6.11% 

84.2% 

6.35% 

comprehension (planned path) 

96.2% 

0.38% 

88.5% 

11.5% 

comprehension (measured path) 

95.5% 

1.42% 

84.7% 

9.9% 



Figure 9: Correctness, completeness, and concise¬ 
ness results of human evaluation of sentences au¬ 
tomatically generated from manually driven paths 
and automatically driven paths produced by com¬ 
prehension of provided sentences. 

determine paths driven. Pairs of sentences and 
paths obtained during both generation and com¬ 
prehension were given to a pool of 6 independent 
judges to obtain 3 judgments on each. Judges were 
asked to label each path prepositional phrase in 
each sentence paired with the entire path as being 
either ‘correct’ or ‘incorrect’, Le., whether it was 
true of the intended portion of the path as deter¬ 
mined by that judge. For generation, judges were 
also asked to assess how much of the path was 
described by the sentence, giving a completeness 
judgment ranging from 0 (worst) to 5 (best). These 
were converted to percentages. For comprehen¬ 
sion, judges were also asked to assess what frac¬ 
tion of the path constitutes motion that is described 
by the sentence (quantized as 0 to 5). These were 
again converted to percentages to measure com¬ 
pleteness. For generation, judgments were ob¬ 
tained twice, pairing each input path with sen¬ 
tences generated using the hand-constructed mod¬ 
els from Fig. as well the learned models from 
Fig. [8] For comprehension, judgments were also 
obtained twice, pairing each input sentence with 
both the planned path as well as the actually driven 
path as determined by odometry and inertial guid¬ 
ance. Fig. [^top) summarizes the judgments ag¬ 
gregated across the 3 judges and 100 samples. The 
standard deviations are across the mean value of 
the 3 judges for each sample. Overall, the average 
“correctness” reported is 94.6% and the average 
“completeness” reported is 85.6%. 

For generation, we also measured “concise¬ 
ness” by having the 3 human judges score each 
generated sentence as -2 (much too short), -1 (too 
short), 0 (about right), 1 (too long), or 2 (much too 
long). Fig. [^bottom) summarize these judgments 
as histograms. Overall, judges assessed that the 
generated sentence length was ‘about right’ a little 
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The robot went be¬ 
hind the cone then 
went away from the 
cone then went be¬ 
hind the cone then 
went behind the bag. 


The robot went be¬ 
hind the cone then 
went in front of the 
stool then went in 
front of the stool 
then went right of 
the box which is left 
of the box then went 
left of the cone then 
went in front of the 
box which is right 
of the box then went 
in front of the box 
which is left of the 
box. 


The robot went in 
front of the table 
then went right of 
the table then went 
behind the table then 
went left of the ta¬ 
ble then went right 
of the cone then went 
in front of the cone 
then went left of the 
cone. 


The robot went left 
of the bag then went 
behind the chair 
which is right of 
the chair then went 
behind the chair 
which is left of the 
chair then went left 
of the chair which is 
left of the chair then 
went in front of the 
chair which is left of 
the chair then went 
in front of the chair 
which is right of the 
chair. 


The robot went in 
front of the stool 
then went right of 
the chair which is 
right of the bag then 
went in front of the 
chair which is right 
of the bag then went 
in front of the bag 
then went left: of the 
bag then went be¬ 
hind the bag then 
went away from the 
bag then went left of 
the stool then went 
in front of the stool 
then went right of 
the chair which is 
right of the bag. 


The robot went be¬ 
hind the bag then 
went left of the bag 
then went in front of 
the bag then went in 
front of the cone then 
went behind the cone 
then went behind the 
bag then went be¬ 
hind the table. 


The robot went left 
of the stool then 
went towards the 
cone then went be¬ 
hind the table which 
is right of the bag 
then went in front of 
the stool. 


the robot went to¬ 
wards the bag then 
went away from the 
table then went in 
front of the box then 
went towards the 
chair. 


The robot went 
towards the bag 
then went towards 
the stool then went 
towards the table 
which is left of the 
stool then went in 
front of the bag. 


The robot went away 
from the table which 
is behind the box 
then went right of 
the stool then went 
right of the table 
which is behind 
the box then went 
towards the table 
which is left of the 
box. 


The robot went to¬ 
wards the bag which 
is left of the stool 
then went towards 
the table then went 
behind the table then 
went left of the bag 
which is left of the 
stool. 


The robot went in 
front of the chair 
then went in front 
of the box which is 
right of the box then 
went behind the box 
which is right of the 
box then went to¬ 
wards the box which 
is left of the box. 





Figure 8: Example experimental runs, 6 out of 250 for acquisition and 100 for each of generation 
and comprehension. Videos available at http : //drivingundertheinfluenceof language . 
blogspot.com, 










































































over half of the time, with generation erring more 
towards being too long than too short. 

6 Conclusion 

We demonstrate a novel approach for grounding 
the semantics of natural language in the domain of 
robot navigation. Sentences describe paths taken 
by the robot relative to other objects in the en¬ 
vironment. The meanings of nouns and preposi¬ 
tions are trained from a corpus of paths driven by 
a human teleoperator annotated with sentential de¬ 
scriptions. These can then support both automatic 
generation of sentential descriptions of new paths 
driven as well as automatic driving of paths to sat¬ 
isfy navigational goals specified in provided sen¬ 
tences. This is a step towards the ultimate goal of 
grounded natural language that allows machines to 
interact with humans when the language refers to 
actual things and activities in the real world. 
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